Prosper Loan Data Exploration by Nicole Schoen

Univariate Plots Section

##  Term_f                     LoanStatus      ClosedDate        
##  12: 1613   Current              :56576   Min.   :2009-09-09  
##  36:58510   Completed            :19498   1st Qu.:2012-06-19  
##  60:24545   Chargedoff           : 5326   Median :2013-02-25  
##             Defaulted            :  996   Mean   :2012-12-25  
##             Past Due (1-15 days) :  806   3rd Qu.:2013-09-12  
##             Past Due (31-60 days):  363   Max.   :2014-03-10  
##             (Other)              : 1103   NA's   :58848       
##  LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber
##  Min.   :   0.00           Min.   : 1.00                
##  1st Qu.:   0.00           1st Qu.: 9.00                
##  Median :   0.00           Median :13.00                
##  Mean   :  36.52           Mean   :14.46                
##  3rd Qu.:   0.00           3rd Qu.:19.00                
##  Max.   :1593.00           Max.   :41.00                
##                            NA's   :78442                
##  LoanMonthsSinceOrigination LoanOriginationDate  LoanOriginationQuarter
##  Min.   : 0.00              Min.   :2009-08-11   Q4 2013:14449         
##  1st Qu.: 4.00              1st Qu.:2012-02-27   Q1 2014:12172         
##  Median :11.00              Median :2013-04-09   Q3 2013: 9180         
##  Mean   :15.94              Mean   :2012-11-18   Q2 2013: 7099         
##  3rd Qu.:25.00              3rd Qu.:2013-11-05   Q3 2012: 5632         
##  Max.   :55.00              Max.   :2014-03-12   Q2 2012: 5061         
##                                                  (Other):31075         
##  MonthlyLoanPayment LoanOriginalAmount PercentFunded      Investors     
##  Min.   :   0.0     Min.   : 1000      Min.   :0.7000   Min.   :   1.0  
##  1st Qu.: 157.6     1st Qu.: 4000      1st Qu.:1.0000   1st Qu.:   1.0  
##  Median : 252.2     Median : 7500      Median :1.0000   Median :  31.0  
##  Mean   : 292.3     Mean   : 9094      Mean   :0.9981   Mean   :  68.1  
##  3rd Qu.: 389.0     3rd Qu.:13750      3rd Qu.:1.0000   3rd Qu.:  97.0  
##  Max.   :2251.5     Max.   :35000      Max.   :1.0125   Max.   :1189.0  
##                                                                         
##            listingCategory 
##  Home Improvement  :53084  
##  Other             : 9178  
##  Business          : 6783  
##  Personal Loan     : 5285  
##  Auto              : 2230  
##  Household Expenses: 1996  
##  (Other)           : 6112

## 
## FALSE  TRUE 
##  8909 75759

## Source: local data frame [6 x 4]
## 
##   OriginationYear     mean_ median_     n
##             (chr)     (dbl)   (dbl) (int)
## 1            2009  4401.833    3000  1807
## 2            2010  4775.103    3900  5564
## 3            2011  6692.021    5500 11228
## 4            2012  7833.842    5000 19553
## 5            2013 10545.300   10000 34344
## 6            2014 11912.220   10000 12172

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    31.0    68.1    97.0  1189.0
## 
##     1     2     3     4     5     6     7     8     9    10 
## 27439  1200   872   696   615   565   553   614   546   526

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5000   10000   11100   15000   35000

Most of the loans are spent on home improvement. The amount ranges from 1000 to 35000$ ( median = 7500$ ), while most of the loans have an amount of up to 15000$ (where 500$ , 1000$ and 15000$ are most common).

We can also see a first trend that since 2009 the average amount that is borrowed increased from around 4000$ to around 10000$ in 2013. An the trend seems to continue for the beginning of 2014 (we only have data till march 2014).

Most of the loans are in status ‘current’ and ‘completed’ which is what you expect for a working platform. However, the number of loans in status current is very high (more than 60%), especially as we have data of almost 5 years and the average loan runs for 3 years, so there should be a fair amount of loans that reached the completed/defaulted status. If we add time information (reference time is about march 2014), we see that Prosper saw a tremendous increase of loans being funded through its platform in the last year (2013). And as the average loan runs for 3 years this explains why so many loans are in status ‘current’.

By far the most loans have one single investor (27440), but some have up to 1189 investors (the median is 68 investors). It looks like quite a number investors decide to invest in specific loans (providing the full amount) instead of using the tools like QuickInvest provided by the plattform. The dataset does not contain info on how the amount is split among investors. But we can see, that single investors are common among all loan amounts.

## [1] 19498    88
##  LP_CustomerPrincipalPayments LP_InterestandFees  LP_ServiceFees   
##  Min.   :   39.34             Min.   :   -0.003   Min.   :-589.95  
##  1st Qu.: 3000.00             1st Qu.:  322.480   1st Qu.: -75.08  
##  Median : 4999.26             Median :  744.490   Median : -37.72  
##  Mean   : 6448.79             Mean   : 1070.187   Mean   : -55.35  
##  3rd Qu.: 8500.00             3rd Qu.: 1494.088   3rd Qu.: -17.07  
##  Max.   :35000.00             Max.   :10013.570   Max.   :   2.87  
##  LP_CollectionFees   LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-1115.630   Min.   :0             Min.   :0          
##  1st Qu.:    0.000   1st Qu.:0             1st Qu.:0          
##  Median :    0.000   Median :0             Median :0          
##  Mean   :   -3.126   Mean   :0             Mean   :0          
##  3rd Qu.:    0.000   3rd Qu.:0             3rd Qu.:0          
##  Max.   :    0.000   Max.   :0             Max.   :0          
##  LP_NonPrincipalRecoverypayments LP_CustomerPayments
##  Min.   :0                       Min.   :   75.86   
##  1st Qu.:0                       1st Qu.: 3524.39   
##  Median :0                       Median : 5647.26   
##  Mean   :0                       Mean   : 7518.97   
##  3rd Qu.:0                       3rd Qu.:10302.02   
##  Max.   :0                       Max.   :37369.16

## [1] 219  88
## (-1.5e+04,-1e+03]     (-1e+03,-100]        (-100,-10]          (-10,10] 
##                 1                55                67                37 
##          (10,100]       (100,1e+03]   (1e+03,1.5e+04] 
##                36                16                 7
##   [1] 15001.029  7500.596 13000.128  7500.697 15000.000  4000.000 20000.000
##   [8]  4000.272  9499.755 20002.743  6500.558 11001.395 15000.562  8000.753
##  [15] 15001.385  5000.000  8999.776 15001.258  5000.000  7500.000  5999.767
##  [22]  7499.662  7500.595  8500.716  6000.493  6000.000 10001.396 10000.000
##  [29]  6999.886 15000.000 10000.000  7000.422 10000.000  4000.555 13500.618
##  [36]  7000.236 15001.408  8000.460  7500.355 15000.000 11999.727 10000.000
##  [43]  4499.900 13001.614 12001.139  4000.679 15001.415 14500.891 21002.294
##  [50] 13500.626 15000.502  7500.312 15001.623  8400.686 15001.011  3499.939
##  [57] 10001.245 15001.001  5000.702 15000.508  5500.199  8000.455 21000.420
##  [64] 10001.369  4000.285 20000.000  4000.852 10000.000 15001.500  7000.501
##  [71] 12001.144  5000.000 20001.419 15001.390  7000.000  5000.000 10000.000
##  [78]  8501.127 17000.799 15000.577 15001.362 25001.376 25000.711 13000.446
##  [85] 10000.000  7500.300 12501.183  7500.355 15001.350  4000.211  4000.776
##  [92]  7500.764 10000.000 25002.697  9000.524  3000.000  8000.000 15001.971
##  [99] 25001.197  4000.556 12001.400 15000.000 10001.312 15002.062  4000.508
## [106] 10000.000 19501.862  4001.023  3999.729 15001.410 15000.616  9000.910
## [113] 10000.000 10000.000  4000.713 15000.681 15000.000  6000.000 20002.703
## [120] 10500.766  4000.000  6000.242  7000.672 15000.000 15000.000  4000.273
## [127] 12001.260  2999.867  4000.816 15001.257 11000.991  5000.000 12501.444
## [134] 25000.000  8000.000 15000.657 12000.955 12000.252 25003.536  5000.000
## [141] 10000.000 12001.687  7000.470  9499.868 24502.038 15001.011  4000.422
## [148]  5500.288 10500.729 15001.054 10000.000 10001.053  5499.840  5400.046
## [155] 25001.817  7500.259 15001.370 25001.679 25000.000 15001.404 15001.818
## [162] 14501.481  7500.297 15002.137  5225.047 10001.402 25001.723  4000.000
## [169] 15002.058 17001.606 10000.000 15000.000  4000.549  4001.115  7500.554
## [176] 20001.158 20000.000  7500.609 15000.000  7500.346 12000.850 15002.136
## [183] 15001.423 15001.420  5000.700  4500.068 15000.542  3999.764  8000.200
## [190]  6000.000  7500.355 15000.650 15001.395 25001.341  4000.000 10000.000
## [197] 15000.534  4000.211 12500.972  5000.000 10000.000 15002.099 10001.133
## [204]  7500.295 11501.721  3499.770 10000.000 13500.434  7999.782 10001.365
## [211]  4000.411  9600.100  4000.516 21002.607  4000.281  3000.123  4000.839
## [218] 15999.785  9000.659 12000.247 17500.552 15001.412 15500.418 25002.857
## [225] 11500.922 15002.114 13001.555 24502.195 15001.357 15002.078  3999.528
## [232] 15001.203 10499.930 15000.695  4000.000  5000.551 24000.000 10001.367
## [239] 12500.000 15001.249  4000.000  7000.000 15000.710 15000.000  5000.534
## [246] 10000.000  5500.779  8000.891 13001.470 17930.302  8500.250 10000.000
## [253]  6500.265 15001.353 15000.526 15002.128 25000.554 25002.125 11999.737

## [1]  0 88

I first want to look only at loans in status ‘completed’ to get a feel for the features LP_CustomerPayments, LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees. As these loans are in status ‘completed’ all principal, interest and fees should be payed. In the summary for this subset we see no loss or recovery payments, but some collection fees. So in between payment was late, but in the end everything was repayed. In the majority of the cases (19279 out of 19498) LoanOriginalAmount and LP_CustomerPrincipalPayments are equal (difference is smaller +/- 1). But there are a few cases (219) with partially considerable discrepancies of hundreds or tousands that were payed back too much or too few. When looking at these cases some were payed back after only a few months, but had a term of 3 or even 5 years. Maybe some amounts are not correctly accounted for? With the data available in this dataset we cannot answer this question, as we have no info on the single payments, only aggregated values. The percent funded is already incorporated in LoanOriginalAmount. This amount is not the amount originally requested by the borrower, but only the amount that was actually funded. LP_CustomerPayments = LP_CustomerPrincipalPayments + LP_InterestandFees holds for all loans, no outliers here.

##  BorrowerState                 Occupation         EmploymentStatus
##  CA     :10731   Other              :21271   Employed     :67309  
##  NY     : 5839   Professional       :10523   Full-time    : 7758  
##  TX     : 5627   Executive          : 3464   Self-employed: 4532  
##  FL     : 5401   Computer Programmer: 3220   Other        : 3806  
##  IL     : 4248   Teacher            : 2882   Not employed :  647  
##  OH     : 3367   Analyst            : 2729   Retired      :  364  
##  (Other):49455   (Other)            :40579   (Other)      :  252  
##  EmploymentStatusDuration IsBorrowerHomeowner DebtToIncomeRatio
##  Min.   :  0.0            False:39922         Min.   : 0.000   
##  1st Qu.: 30.0            True :44746         1st Qu.: 0.150   
##  Median : 74.0                                Median : 0.220   
##  Mean   :103.1                                Mean   : 0.259   
##  3rd Qu.:148.0                                3rd Qu.: 0.320   
##  Max.   :755.0                                Max.   :10.010   
##  NA's   :19                                   NA's   :7279     
##          IncomeRange    IncomeVerifiable StatedMonthlyIncome
##  $50,000-74,999:25561   False: 7318      Min.   :      0    
##  $25,000-49,999:24117   True :77350      1st Qu.:   3442    
##  $100,000+     :15189                    Median :   5000    
##  $75,000-99,999:14468                    Mean   :   5934    
##  $1-24,999     : 4644                    3rd Qu.:   7083    
##  Not employed  :  647                    Max.   :1750003    
##  (Other)       :   42
## [1] 688
## [1] 0

##  [1]  416666.7  483333.3  108333.3  150000.0  618547.8 1750002.9  466666.7
##  [8]  120833.3  158333.3  250000.0  108750.0  394400.0

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1 12185

The majority of the borrowers comes from California, followed by NY and Texas. Most of them have some kind of employment, only few indicated to be unemployed or retired (647 and 364). More than half of the borrowers are homeowners. The median stated monthly income is 5000 $. There are some outliers with monthly incomes above 100000 $. If we cut off the highest 1% of incomes we get a slightly positively skewed distribution of incomes. The stated monthly incomes are consistent with the associated income ranges. When breaking out by loan status we see that the income distributions for each loan status are pretty similar to the overall income distribution, so income does not seem to be a strong indicator for whether a loan is payed back in time or not. Note that 12185 borrowers have or had multiple loans in the time between 2009 and 2014.

##   LenderYield     EstimatedEffectiveYield EstimatedLoss    
##  Min.   :0.0300   Min.   :-0.1827         Min.   :0.00490  
##  1st Qu.:0.1259   1st Qu.: 0.1166         1st Qu.:0.04240  
##  Median :0.1775   Median : 0.1615         Median :0.07240  
##  Mean   :0.1861   Mean   : 0.1688         Mean   :0.08034  
##  3rd Qu.:0.2474   3rd Qu.: 0.2243         3rd Qu.:0.11200  
##  Max.   :0.3400   Max.   : 0.3199         Max.   :0.36600  
##                                                            
##  EstimatedReturn     ProsperScore   ProsperRating  BorrowerAPR     
##  Min.   :-0.18270   4      :12594   AA: 5338      Min.   :0.04935  
##  1st Qu.: 0.07455   6      :12271   A :14509      1st Qu.:0.16361  
##  Median : 0.09211   8      :12007   B :15566      Median :0.21945  
##  Mean   : 0.09609   7      :10581   C :18303      Mean   :0.22674  
##  3rd Qu.: 0.11660   5      : 9806   D :14248      3rd Qu.:0.29254  
##  Max.   : 0.28370   3      : 7639   E : 9781      Max.   :0.42395  
##                     (Other):19770   HR: 6923                       
##   BorrowerRate   
##  Min.   :0.0400  
##  1st Qu.:0.1359  
##  Median :0.1875  
##  Mean   :0.1961  
##  3rd Qu.:0.2574  
##  Max.   :0.3600  
## 
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-589.95  
##  1st Qu.:  392.4              1st Qu.:  254.08   1st Qu.: -72.33  
##  Median : 1247.3              Median :  674.95   Median : -34.90  
##  Mean   : 2624.3              Mean   : 1042.86   Mean   : -54.65  
##  3rd Qu.: 3400.0              3rd Qu.: 1436.66   3rd Qu.: -14.28  
##  Max.   :35000.0              Max.   :10572.78   Max.   :   3.01  
##  LP_CollectionFees   LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-4865.080   Min.   :  -94.2       Min.   : -504.4    
##  1st Qu.:    0.000   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.000   Median :    0.0       Median :    0.0    
##  Mean   :   -8.198   Mean   :  376.2       Mean   :  367.9    
##  3rd Qu.:    0.000   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.000   Max.   :25000.0       Max.   :25000.0    
##  LP_NonPrincipalRecoverypayments LP_CustomerPayments
##  Min.   :   0.000                Min.   :   -2.35   
##  1st Qu.:   0.000                1st Qu.:  797.52   
##  Median :   0.000                Median : 2213.68   
##  Mean   :   7.639                Mean   : 3667.20   
##  3rd Qu.:   0.000                3rd Qu.: 4868.19   
##  Max.   :7780.030                Max.   :37369.16
## [1] 84628    88

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(loanData$ProsperRating) and loanData$EstimatedLoss
## t = 1061.6, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9639573 0.9648986
## sample estimates:
##      cor 
## 0.964431

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(loanData$ProsperScore) and loanData$EstimatedLoss
## t = -265.38, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6775265 -0.6701722
## sample estimates:
##       cor 
## -0.673866

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(loanData$CreditScoreRangeUpper) and loanData$EstimatedLoss
## t = -172.82, df = 84666, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5156107 -0.5056518
## sample estimates:
##        cor 
## -0.5106484

The BorrowerRate is the LenderYield plus 1% (probably service fee). Lender Yield is in the range from 3 to 34%. In 2009-11 there are loans over the full range, while in the year 2012-14 the range reduces more and more (in 2014 it is between 5% and 30%). In 2011 and 2012 we have high peaks at around 30% while in 2013/14 the distribution of lender yield becomes more balanced. Furthermore we see that the lender yield becomes more and more aligned with the Prosper Rating over the years. Prosper uses an agency score and calcualates a ProsperScore. In addition there is a ProsperRating giving Prosper’s risk assessment (risk of loss) for the respective loan. And obviously when plotting the Prosper Rating against the Estimated Loss we find a clear correlation. ProsperScore and ProsperRating seem to be correlated, as well as the ProsperRating and the agency score. When looking it up in the docu you find that the ProsperRating takes into acount the ProsperRating and the agency score together with some additional parameters. What is intersting is that in 2014 it looks like Prosper increased the minimum agency score that a borrower must have to be allowed to apply for a Prosper loan. Might be that this is also related to the more balanced distribution of lender yield.

##    12    36    60 
##  1613 58510 24545
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    8.00   13.00   16.07   23.00   41.00   58848

## 
## FALSE  TRUE 
##  2875 16623

When we look at the closed date then we see that the number of closed loans increased over time, especially in the last 2 years. Maybe some of the many newly originated loans were payed back early? So the high number of new loans also leads to an increase in loan completion?

Looks like many loans are payed back early. Loan periods are 12, 36 and 60 months. Most common is a term of 36 months, while 12 months is pretty rare. When we look at the actual difference between LoanOriginationDate and ClosedDate (actualTerm in the graphs) we see 2 peaks for the completed loans, one around 12 months and one at about 36 months, as expected. But especially in the first 12 months there is a high number of completions and also between 12 and 36.
When calculating the ratio of the indicated term and the actual difference between origination and close date we see that the majority is repayed a few or even many months before its term ends (22791 out of 25820). And that seems to be true for older as well as for newer loans.

## 
##   (0,5]  (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45] 
##    3765    5489    5513    3473    2404    1672    1238    2244      22 
## (45,50] (50,55] (55,60] 
##       0       0       0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1493     513    1459    2219    3178   17440

## 
##       (999,4e+03]  (4e+03,5.25e+03]  (5.25e+03,1e+04]   (1e+04,1.5e+04] 
##             28072              5804             24039             17844 
## (1.5e+04,3.5e+04] 
##              8909

The distribution of the interest difference (expected (ideal) amount of interest - actual amount of interest) has a very long tail. Log-transforming it gives a more normal distribution, however with a strong tail to left. Note, that for log-transforming it I left out the negative values (where people payed more than the expected interest amount). Coloring the distribution of the log-transformed interest difference shows nicely how the the interest difference increases with shorter repayment terms and longer agreed loan term. The loan amount also seems to be a factor, on the other hand for the Prosper Rating there is not a clear picture.

Univariate Analysis

What is the structure of your dataset?

The dataset contains information about 84668 loans that are processed through the Prosper platform. The dataset containd 87 varables for each loan reaching from on - the loan itself (amount, when it was originated, how many investors,…) - the status of the loan (aggregated info on customer payments or loss, payed interest, fees and principal) - the borrower (income, emloyment status/duration, state,…) - credit score and other indicators on credit-worthiness provided by an agency - on credit-worthimess based on previous prosper loans - and a set of figures provided by prosper to rate the given loan (prosper score, lender yield, estimated loss, estimated return)

Most of the loans are spent on home improvement. The amount ranges from 1000 to 35000$ (median = 7500$ and 500\(, 1000\) and 15000$ being the most common amounts). Since 2009 the average amount that is borrowed increased from around 4000$ to around 10000$ in 2013/2014. We also see a tremendous increase of loans being funded through the Prosper platform in the last year (reference time is about march 2014).

By far the most loans have one single investor (27440), but some have up to 1189 investors (the median is 68 investors). It looks like quite a number investors decide to invest in specific loans (providing the full amount) instead of using the tools like QuickInvest provided by the plattform.

The majority of the borrowers comes from California, followed by NY and Texas. Most of them have some kind of employment, only few indicated to be unemployed or retired (647 and 364). more than half of the borrowers are homeowners. The median stated monthly income is 5000$. There are some outliers with monthly incomes above 100000 $. If we cut off the highest 1% of incomes we get a slightly positively skewed distribution of incomes.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the interest rate the lender can achieve with the different loans. In the dataset we have different variants of this feature: - borrower rate (interest rate of the borrower) - lender yield (interest rate - 1% service fee) - estimated effective yield (borrower interest rate - servicing fee rate - estimated uncollected interest on charge-offs + estimated collected late fees)

I want to look at the lender yield. This is what the lender receives in an ideal case where the borrower pays back the full principal and interest. While the effective yield is equal to the lender yield minus estimated uncollected interest on charge-offs plus estimated collected late fees.

In addition the LP values give us summary information of what the borrower actually payed attributed to principal, interest, fees (LP_CustomerPrincipalPayments, LP_InterestandFees, LP_ServiceFees). From this we can calculate the amount of interest that the lender actually gets and compare to what is the expected amount of interest in the ideal case.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Time information like LoanOriginationDate/Quarter and ClosedDate can help to see how things evolved over time. Furthrmore information on the loan itself like the amount and the term for payback seems to be of interest when determining the lender yield. And certainly an indicator for the risk that is associated with the respective loan. Here the propser rating alias estimated loss is the indicator that I want to look at. It is derived from the Prosper score and the credit score and is used by Prosper to give an indicator for risk (estimated loss) associated with a loan.

Did you create any new variables from existing variables in the dataset?

I goggled how to calculate the the monthly payment (annuity): A = Kq^nT (q-1)/(q^nT - 1) with q = i/n where K is the loan amount, T is the term (here 3 or 5 years), n number of payments per year (in our case is 12 as we have a monthly payment) and i is interest rate per year. This way the monthly payment of the borrower is calculated. In addition I calculated the interest amount that the lender could expect in the best case where the borrower pays back all the principal and interest taking the term as indicated in the listing. However, as the borrower can pay back parts or the whole loan early whenever he can afford it, it can be interesting to look at the actual interest that was payed for completed loans and compare it to the best case interst. So in addition I calculated the difference between actual and best case interest payment. In addition I created the variable actualTerm which specifies how many months it took till the borrower actually payed back his loan (LoanOriginationDate - ClosedDate)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I converted date variables like LoanOriginationDate, CloseDate to the R Date type. I parsed the LoanOriginationQuarter parameter to be able to sort it first by year and then by quarter. Furthermore I converted some of the variables, like Term or listingCategory to factors. When I calculated the difference between CloseDate and LoanOriginationDate I found a few cases where the CloseDate < LoanOriginationDate. As this does not make sense I dropped these loans.

I found that the interest difference (expected (ideal) amount of interest - actual amount of interest) has a very long tail,so I tried a log transformation which leads to a more normal distribution of the data with a mean around 1000.

Bivariate Plots Section

When looking at the distribution of lender yield over time, we already found that the distribution becomes more balanced and more aligned to the Propser Rating. The boxplot by loan origination quarter also show how the inter-quantile range becomes smaller. Furthermore the mean lender yield decreases. When looking at the violin plot we see that the bulk of the lender yields for the loans moves more to the mean starting in 2013. Prosper loans seem to move away from the high risk loans with high lender yields.

Things seem to be pretty stable over the quarters of a year, so breaking out by year should be enough.

When looking at the lender yield vs the Estimated Loss we see a very high variance in the lender yield. This proabably due to changes in how lender yields were assigned to different risk categories over the years? But breaking out by loan origination quarter shows that till 2010 there was a high variance in how the lender yield was aligned with the associcated risk, but then we get a more and more linear relation with almost no variance in 2014.

In 2012/13 we have 3 separate ‘lines’, wondering were this comes from. It seems to be agreed on term for the loan that also influenced the lender yield (the shorter the repayment term the lower the interest a borrower has to pay). However, in 2014 this effect cannot be seen any more, so Prosper does not seem to give better interest rates for shorter loan terms, the 12-months term goes away completely in 2014.

The loan amount also seems to determine the lender yield in a way. But this seems to be more related to the Prosper Score of the borrower. With lower a Prosper Score a borrower can only apply for lower amounts. At least this pattern emerges starting in 2011 to be become pretty clear in 2014.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

When looking at the lender yield vs the Estimated Loss we see a very high variance in the lender yield. Breaking out by loan origination quarter shows that till 2010 there was a high variance in how the lender yield was aligned with the associcated risk, but then we get a more and more linear relation with almost no variance in 2014. So Prosper obviously adjusted its policy for assigning interest rates to loans according to the associated risk. While the agreed on loan term was still a factor that determined lender yield in 2012/13, this is no more the case in 2014, so Prosper seems to have decided to not give better interest rates for shorter loan terms any more.

This fits with the observation that I made with regard to the actual term/time that borrowers need to repay their loans. The majority of the loans are payed back early anyway (16623 out of 19498 completed loans) and no borrower seems to fully use their 5-year term, most do not even use their 3-year term as we have seen earlier.

When looking at the relation between interest difference (expected amount of interest (in the ideal case) - actual amount of interest) and actual repayment period we find that with shorter repayment period (actual term) the interest difference increases, but also its variance. Holding the actual term constant we see the highest interest difference for loans with a 5-year term and the lowest for loans with a 1-year term. Furthermore, holding actual term constant we also see higher interest difference for higher loan amounts again with increasing variance.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between Prosper Score and loan amount. Plotting the loan amount vs. the Estimated Loss and coloring by Prosper Score and then breaking out by origination year shows clearly how Prosper assigned different maximum loan amounts to people with different Propser Scores and how that evolved over time. In 2009 there seems to be no gradation, but till 2014 more and more clear steps can be seen. Starting in 2013 Propser increased its overall maximum loan to 35000 (25000 previously), but only for Prosper Score 7 or better. The association of lower maximum loan amounts with worse Propser Scores and higher amounts for better Prosper Scores respectively, is proabably one of the measures taken that led to the more and more balanced (normal) distribution of lender yields that we observed previously.

What was the strongest relationship you found?

The relationship between lender yield and estimated loss in the years 2012 - 2014.

Multivariate Plots Section

## 
## Calls:
## m1: lm(formula = LenderYield ~ EstimatedLoss, data = subset(loanData, 
##     OriginationYear > 2010))
## m2: lm(formula = LenderYield ~ EstimatedLoss + Term_f, data = subset(loanData, 
##     OriginationYear > 2010))
## 
## =======================================
##                     m1          m2     
## ---------------------------------------
## (Intercept)       0.059***    0.021*** 
##                  (0.000)     (0.000)   
## EstimatedLoss     1.588***    1.606*** 
##                  (0.001)     (0.001)   
## Term_f: 36/12                 0.032*** 
##                              (0.000)   
## Term_f: 60/12                 0.047*** 
##                              (0.000)   
## ---------------------------------------
## R-squared             0.938       0.953
## adj. R-squared        0.938       0.953
## sigma                 0.018       0.016
## F               1164623.865  516716.727
## p                     0.000       0.000
## Log-likelihood   201011.893  211461.613
## Deviance             24.941      19.032
## AIC             -402017.786 -422913.226
## BIC             -401990.020 -422866.949
## N                 77297       77297    
## =======================================

## 
## Calls:
## m1: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~ 
##     ProsperRating - 1, data = sub1)
## m2: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~ 
##     ProsperRating + Term_f - 1, data = sub1)
## m3: lm(formula = I(sqrt(sub1$interest_diff/sub1$MonthlyAmount)) ~ 
##     ProsperRating + Term_f + actualTerm - 1, data = sub1)
## 
## ===================================================
##                        m1         m2         m3    
## ---------------------------------------------------
## ProsperRating: AA   1.323***  -0.049      0.342*** 
##                    (0.032)    (0.032)    (0.012)   
## ProsperRating: A    1.663***   0.129***   0.585*** 
##                    (0.023)    (0.029)    (0.011)   
## ProsperRating: B    2.185***   0.453***   0.888*** 
##                    (0.023)    (0.030)    (0.011)   
## ProsperRating: C    2.550***   0.741***   1.145*** 
##                    (0.022)    (0.030)    (0.011)   
## ProsperRating: D    2.433***   0.877***   1.386*** 
##                    (0.018)    (0.029)    (0.011)   
## ProsperRating: E    2.559***   1.119***   1.569*** 
##                    (0.024)    (0.030)    (0.011)   
## ProsperRating: HR   2.473***   1.215***   1.628*** 
##                    (0.024)    (0.031)    (0.012)   
## Term_f: 36/12                  1.258***   1.867*** 
##                               (0.028)    (0.011)   
## Term_f: 60/12                  3.430***   3.765*** 
##                               (0.030)    (0.011)   
## actualTerm                               -0.080*** 
##                                          (0.000)   
## ---------------------------------------------------
## R-squared               0.820      0.930      0.990
## adj. R-squared          0.820      0.930      0.990
## sigma                   1.069      0.665      0.250
## F                    9900.795  22535.438 153300.805
## p                       0.000      0.000      0.000
## Log-likelihood     -22623.597 -15405.452   -475.269
## Deviance            17390.556   6741.139    949.317
## AIC                 45263.193  30830.904    972.538
## BIC                 45324.243  30907.216   1056.482
## N                   15233      15233      15233    
## ===================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I found that looking at the interest difference relative to the Monthly Payment for the loan reduces the variance tremendously which makes sense as the interest difference depends on the loan’s amount which is related to the Monthly Rate the borrower has to pay. I looked at difference in interest amount per Monthly Payment vs. actual repayment period, as I expect the actual payment period to be a determining factor for the interest difference. Holding the actual repayment period constant we always get a higher interest difference for a longer agreed on term (1-year vs 3-year vs 5-year). This is proabably because the expected amount of interest depends on the agreed on term.

Were there any interesting or surprising interactions between features?

Holding the actual repayment period constant we get 3 groups. The 3 groups seem to perfectly align with the agreed on terms. That means holding the actual repayment period and the agreed on term constant we always have a a higher interest difference for worse propser rating. Maybe because of the higher interest rate the borrower has to pay when getting a worse rating. I first log-transformed the interest difference per Monthly Payment, but then found using a sqrt-transformation shows a more linear relationship. So the last 2 plots suggest that we can use the feature variables to build a linear model for interest difference.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried buildng a linear model for Lender Yield and for interest difference. The model for Lender Yield is pretty simple I just used the features Estimated Loss and Term and a get a model that according to the r squared accounts for ~95% of the variation which is pretty good. When when looking at the residuals these are are reasonably well normally distributed in the Lender Yield range from 10-30%, but not outside this range as also the normal QQ-plot shows.

Regarding the model for interest difference, I looked at interest difference per Monthly Payment and used a sqrt-transformation. As features I used Term, Prosper Rating and actual repayment period. The r square suggests that Prosper Rating and Term account for 93 % of the variation. Adding actual repayment period increases this to 95%. But I was surprised that actual repayment period only added so “little” or better but that Prosper Rating, Term and Monthly Payment explain so much of the variance. This might be because the actual repayment period might already depend on the other features like Propsper Rating or Monthly Payment. When we look at the ggpairs plot I did previously we can see that it is correlated with the loan’s original amount (-0.114) and with Lender Yield (-0.412). And we know from previous plots that there is a strong correlation between Lender Yield, Estimated Loss and Prosper Rating and it is also obvious that Monthly Payment and the loan’s original amount are not independant. So we should leave the actual repayment period out. Furthermore this would make the model more valuable as this feature is not in the dataset and you don’t know the actual repayment period in advance. So it would be possible to predict the interest difference only with available features.
However, looking at the residuals these are not normally distributed as the QQ plot suggests. So further investigation is necessary to find additional features that lead to the systematic error.
——

Final Plots and Summary

Plot One

Description One

The plot shows how Prosper aligs the Lender Yield and the risk associated with a loan more stricly over the years. As an example in the 2009-2011 timeframe there exist loans rated as ‘High-Risk’ which were assigned a Lender Yield of as low as 16.8% and up to 34% (median was already 30%), while in 2012-2014 the range reduced to only 27.7-30.7%. This is also reflected in the correlation. In the years 2009-2011 lender there already exists a strong correlation between Estimated Loss and Lender Yield (0.89), however it becomes almost perfect in the years 2012-2014 (0.97). Furthermore one can oberserve how the maximum amount that a borrower can apply for is more and more adjusted to the loan’s associated risk. While in 2009-2010 there is no clear gradation, in 2011 only loans with a Prosper Rating of AA or A have a loan amount of up to $25000. In 2012 this is extended to the B and C rated loans. Interestingly, in 2013 the maximum amount was increased to $35000 but only for AA - B rated loans, while C rated loans stayed at a maximum of $25000, Also the lower ratings have a specific maximum amount (D rated loans up to $15000, E rated loans 10000$ and HR rated only $4000).

Plot Two

Description Two

The plot shows how lender yield and estimated loss (the risk associated with a loan) become more aligned over the years. In 2009 we see a great variance in the lender yield which even increases for loans with a higher estimated loss. The standard deviation grouped by Prosper Rating ranges from 2.3% for AA rated loans to up to 6.5% for HR rated loans. The variance in the lender yield reduces over the years, until we have an almost perfect linear relationship in 2014. In the years 2013/2014 the standard deviation of the lender yield grouped by Prosper Rating fluctuates only between 0.66% and 1.8%. So Prosper obviously adjusted its policy for assigning interest rates to loans according to the associated risk.

Furthermore we can see how the repayment term affects the lender yield that is assigned to a loan. Keeping the Estimated Loss constant we find that loans with shorter repayment term are assigned lower lender yields. This can be seen very clearly for the years 2011, 2012 and 2013. However, in 2014 the repayment term looses its effect on the lender yield. So Prosper seems to have decided to not give better interest rates for shorter loan terms any more. This fits with the observation that the majority of the loans are payed back early anyway (16623 out of 19498 completed loans).

Plot Three

Description Three

This plot further investigates the effect that early repayment of a loan has on the amount of interest that a lender can expect to achieve. The plot shows the factors that primarily determine the interest loss a lender can expect not to achieve due to early repayment of a loan. We look at the the amount of interest relative to the loan’s amount as obviously the amount of interest depends on the borrowed amount. Further applying a Squareroot-transformation shows a negative linear relationship to the actual repayment period. Holding the actual repayment period constant we get 3 groups. The 3 groups seem to perfectly align with the agreed on terms. That means holding the actual repayment period and the agreed on term constant we always have a a higher interest difference for worse propser rating. Probably because of the higher interest rate the borrower has to pay when getting a worse rating.


Reflection

Given that there are so many features in this dataset I started looking at features more or less randomly. As I have no background in finance I did not really know what to look at. In the beginning I struggled with understanding the meaning of many of the variables, but then found that plotting often helped in understanding their meaning and relation(in addition to reading documenation). The dataset contains many variables which are very closely related and to understand all the subtle differences between the different features was not always easy. For some the variables I found in the course of the investigation that another representation (e.g. transform to categorical variable) or reducing the number of categories helped (especially for time related variables, e.g. years instead of quarters/months) when plotting the data. Otherwise one cannot see anything on the plots as they are so overloaded with detail. Then I observed how the number of loans handled by the platform increased over time and that many loans are payed back early. Then I found that lender yield/borrower rate became more and more aligned with the associcated risk over the years and that at the same time that the distribution of the loans with regard to Propser Rating became more balanced, that means more normal. Previously one could observe many more high-risk loans than lower risk loans. So I looked at the factors that determine the Lender Yield more closely and it turned out that in the associated risk really is the determining factor. Furthermore I found it interesting to further investiagte how the early repayment affects the amount of interest that a lender can expect. Sure he gets his money back earlier and can reinvest it, but still it might be of interest as in some cases the money was payed back only after 1 to 3 months, which results in more effort for the lender when he has to take care of re-investing. The data contains loan data till the beginning of 2014, so it might be interesting to look at newer loan data to verify the findings. To further investiagte the interest difference it would be interesting to get more details on when which repayments occured, in this dataset we only get the aggregated amount of borrower payments.